tidyrMessy (3): Multiple observational units are stored in the same table.
What does that mean? The key is split, i.e. for some values all key variables are necessary, while other values only need some key variables.
Splitting into separate datasets:
Messy (4): Values for a single observational unit are stored across multiple tables.
After data are normalized by splitting, we want to de-normalize again by joining datasets.
Sean Lahman is a database journalist, who started databases of historical sports statistics, in particular, the Lahman database on baseball.
library(Lahman)
LahmanData## file class nobs nvar title
## 1 AllstarFull data.frame 4993 8 AllstarFull table
## 2 Appearances data.frame 99466 21 Appearances table
## 3 AwardsManagers data.frame 171 6 AwardsManagers table
## 4 AwardsPlayers data.frame 6026 6 AwardsPlayers table
## 5 AwardsShareManagers data.frame 401 7 AwardsShareManagers table
## 6 AwardsSharePlayers data.frame 6705 7 AwardsSharePlayers table
## 7 Batting data.frame 99846 22 Batting table
## 8 BattingPost data.frame 11294 22 BattingPost table
## 9 CollegePlaying data.frame 17350 3 CollegePlaying table
## 10 Fielding data.frame 167938 18 Fielding table
## 11 FieldingOF data.frame 12028 6 FieldingOF table
## 12 FieldingPost data.frame 11924 17 FieldingPost data
## 13 HallOfFame data.frame 4088 9 Hall of Fame Voting Data
## 14 Managers data.frame 3370 10 Managers table
## 15 ManagersHalf data.frame 93 10 ManagersHalf table
## 16 Master data.frame 18589 26 Master table
## 17 Pitching data.frame 43330 30 Pitching table
## 18 PitchingPost data.frame 4945 30 PitchingPost table
## 19 Salaries data.frame 24758 5 Salaries table
## 20 Schools data.frame 1207 5 Schools table
## 21 SeriesPost data.frame 298 9 SeriesPost table
## 22 Teams data.frame 2775 48 Teams table
## 23 TeamsFranchises data.frame 120 4 TeamFranchises table
## 24 TeamsHalf data.frame 52 10 TeamsHalf table
The Lahman database consists of 24 data frames that are linked by playerID.
This is clean, but not very readable.
The Master table includes names and other attributes for each player.
Joining multiple tables helps us to bring together (pieces of) information from multiple sources.
HallOfFame <- HallOfFame %>% group_by(playerID) %>%
mutate(times = order(yearID))
HallOfFame %>%
ggplot(aes(x = yearID, y = votes/needed, colour = inducted)) +
geom_hline(yintercept = 1, colour = "grey20", size = .1) +
geom_line(aes(group = playerID), colour = "black", size = 0.2) +
geom_point() We’d like to label all the last attempts - and not just with the playerID
HallOfFame %>%
ggplot(aes(x = times, y = votes/needed, colour = inducted)) +
geom_hline(yintercept = 1, colour = "grey20", size = .1) +
geom_line(aes(group = playerID), colour = "black", size = 0.2) +
geom_point() dplyr there are various join functions: left_join, inner_join, full_join, …df1 <- data.frame(id = 1:6, trt = rep(c("A", "B", "C"), rep=c(2,1,3)), value = c(5,3,7,1,2,3))
df1## id trt value
## 1 1 A 5
## 2 2 B 3
## 3 3 C 7
## 4 4 A 1
## 5 5 B 2
## 6 6 C 3
df2 <- data.frame(id=c(4,4,5,5,7,7), stress=rep(c(0,1), 3), bpm = c(65, 125, 74, 136, 48, 110))
df2## id stress bpm
## 1 4 0 65
## 2 4 1 125
## 3 5 0 74
## 4 5 1 136
## 5 7 0 48
## 6 7 1 110
NAright_join works symmetricleft_join(df1, df2, by="id")## id trt value stress bpm
## 1 1 A 5 NA NA
## 2 2 B 3 NA NA
## 3 3 C 7 NA NA
## 4 4 A 1 0 65
## 5 4 A 1 1 125
## 6 5 B 2 0 74
## 7 5 B 2 1 136
## 8 6 C 3 NA NA
inner_join(df1, df2, by = "id")## id trt value stress bpm
## 1 4 A 1 0 65
## 2 4 A 1 1 125
## 3 5 B 2 0 74
## 4 5 B 2 1 136
NAfull_join(df1, df2, by = "id")## id trt value stress bpm
## 1 1 A 5 NA NA
## 2 2 B 3 NA NA
## 3 3 C 7 NA NA
## 4 4 A 1 0 65
## 5 4 A 1 1 125
## 6 5 B 2 0 74
## 7 5 B 2 1 136
## 8 6 C 3 NA NA
## 9 7 <NA> NA 0 48
## 10 7 <NA> NA 1 110
by = c("ID", "Date")by = c("State" = "Name")anti_joindplyranti_join(df1, df2, by="id") # no values for id in df2## id trt value
## 1 6 C 3
## 2 3 C 7
## 3 2 B 3
## 4 1 A 5
anti_join(df2, df1, by="id") # no values for id in df1## id stress bpm
## 1 7 0 48
## 2 7 1 110
Does lifetime batting average make a difference in a player being inducted?
Batting2 <- Batting %>% group_by(playerID) %>%
mutate(BatAvg = H/AB) %>%
summarise(LifeBA = mean(BatAvg, na.rm=TRUE))
hof_bats <- inner_join(HallOfFame %>% filter(category == "Player"), Batting2,
by = c("playerID"))
hof_bats %>%
ggplot(aes(x = yearID, y = LifeBA, group = playerID)) +
geom_point(aes(color = inducted))What about pitchers? Are pitchers with lower lifetime ERAs more likely to be inducted?
Pitching2 <- Pitching %>% group_by(playerID) %>%
summarise(LifeERA = mean(ERA, na.rm = TRUE))
hof_pitch <- inner_join(HallOfFame %>% filter(category == "Player"), Pitching2,
by = c("playerID"))
hof_pitch %>%
ggplot(aes(x = yearID, y = LifeERA, group = playerID)) +
geom_point(aes(color = inducted))## Warning: Removed 2 rows containing missing values (geom_point).
Lahman package into your R session.Master data set and the HallOfFame data.ggplot2 chart label all last attempts of individuals with 15 or more attempts. Make sure to deal with missing values appropriately.Not all organizations involved in inducting worthy candidates into the Hall of Fame take votes.
We will take only voters into consideration:
Voted <- HallOfFame %>% filter(!is.na(ballots)) %>%
group_by(playerID) %>% arrange(yearID) %>%
mutate( attempt = order(yearID))
dim(Voted)## [1] 3927 11
Voted <- left_join(Voted, Master %>% select(playerID, nameFirst, nameLast),
by = "playerID")
dim(Voted)## [1] 3927 13
Voted %>% ggplot(aes(x = attempt, y = votes/needed)) +
geom_hline(yintercept = 1, colour = "grey25", size = 0.1) +
geom_line(aes(group = playerID), colour = "grey35", size = 0.2) +
geom_point(aes(colour = inducted))labels <- Voted %>% group_by(playerID) %>% summarize(
votes = votes[which.max(attempt)],
needed = needed[which.max(attempt)],
attempt = max(attempt),
name = paste(unique(nameFirst), unique(nameLast))
)